Masked LM学習時の出力層側の計算で謎なとこ

from BERTでまだよくわかってないとこ

https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

これのMasked LMのとこ

Adding a classification layer on top of the encoder output.

Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.

Calculating the probability of each word in the vocabulary with softmax.

この２番めのとこ

https://qiita.com/ta2bonn/items/71a1a55b99059f350365

幸い読み込んでる人がいたのでこちらも参照

https://qiita.com/ta2bonn/items/71a1a55b99059f350365#pre-training_masked-lm_仕組み

classification layerのとこの説明

emmbeddingとの掛け合わせとか

これの意味するところがよくわかってない

論文にもちゃんと書いてあるはずなので読もう

タスク説明のとこ

呼んでもわからんかった

実装を見ている

たしかに、modeling.get_embedding_tableをmasked_lmの最終層でmatmulしてるな

output_weightsにinput_embeddingの行列同じものを使ってる

ただしoutput側は専用のbiasがある

input側にはない

呼び出し側

code:py

(masked_lm_loss, masked_lm_example_loss, masked_lm_log_probs) =

get_masked_lm_output(

bert_config,

model.get_sequence_output(),

model.get_embedding_table(),

masked_lm_positions,

masked_lm_ids, masked_lm_weights,

)

第三引数、output_weightsにembedding_tableが入ってる

code:py

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,

label_ids, label_weights):

"""Get loss and log probs for the masked LM."""

input_tensor = gather_indexes(input_tensor, positions)

with tf.variable_scope("cls/predictions"):

このtransform

hidden_actはgelu

https://github.com/google-research/bert/blob/bee6030e31e42a9394ac567da170a89a98d2062f/modeling.py#L40

dense, gelu, norm

code:py

# We apply one more non-linear transformation before the output layer.

# This matrix is not used after pre-training.

with tf.variable_scope("transform"):

input_tensor = tf.layers.dense(

input_tensor,

units=bert_config.hidden_size,

activation=modeling.get_activation(bert_config.hidden_act),

kernel_initializer=modeling.create_initializer(

bert_config.initializer_range))

input_tensor = modeling.layer_norm(input_tensor)

ここね。

output_weightsがimput embeddingsと一緒

outputだけのbiasはある

code:py

# The output weights are the same as the input embeddings, but there is

# an output-only bias for each token.

output_bias = tf.get_variable(

"output_bias",

shape=bert_config.vocab_size,

initializer=tf.zeros_initializer())

logits = tf.matmul(input_tensor, output_weights, transpose_b=True)

logits = tf.nn.bias_add(logits, output_bias)

log_probs = tf.nn.log_softmax(logits, axis=-1)

label_ids = tf.reshape(label_ids, -1)

label_weights = tf.reshape(label_weights, -1)

one_hot_labels = tf.one_hot(

label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

# The positions tensor might be zero-padded (if the sequence is too

# short to have the maximum number of predictions). The label_weights

# tensor has a value of 1.0 for every real prediction and 0.0 for the

# padding predictions.

per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=-1)

numerator = tf.reduce_sum(label_weights * per_example_loss)

denominator = tf.reduce_sum(label_weights) + 1e-5

loss = numerator / denominator

return (loss, per_example_loss, log_probs)

結局この部分に言及しているのは上記の記事だけなんだよな

論文にすらなくて、実装の中でわかる

この処理は一般的なやつで、わざわざ言及するほどではないということだろうか？

まあそもそもinput embeddingとかも言及されてないことが多い

これは自分は勘でああいうものだと思っているが

one hot vectorだけだと当然つらいので、適当な行列かませて学習させてembeddingとして機能させるみたいな

https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a

これembedding tableのとこ説明してる

https://github.com/google-research/bert/issues/47

やっとみつけた

This is fairly standard in terms of training a language model. For one thing, OpenAI GPT also did it, and we wanted to make our model exactly comparable to theirs. But in general it's a way of significantly reducing the number of parameters while improving the results (or at least, it improves results on small LM tasks like Penn TreeBank. It probably doesn't improve results here, but it doesn't hurt and makes the number of parameters much smaller).

lmの学習でかなり標準なことで、GPT-1でもやってる

比較できるように一緒にした

しかし一般的には、結果を改善しながらパラメータの数を大幅に減らす方法です

（少なくとも、Penn TreeBankのような小さなLMタスクでは結果を改善します）。

おそらくここでは結果を改善することはできないだろうが、それは問題ではない

パラメータの数を大幅に減らすことにもなる

この手法自体は、小さいLMタスクで、結果を改善することが分かっている

BERTの場合には結果を改善することはできないかもしれんが、input embeddingと共通のweightであればパラメータ数を大幅に減らせる

大幅とはどのくらいか→input embedding文なので、max_seq_len * hidden_sizeかな

多分この手法にも名前がある可能性があるし、上に指摘されてるように活用されてるモデル例あたりは見つけたい

GPTの論文追えばいい

けど緊急性はないので、気が向いたら…